Regular expressions are a tool for matching text by looking for a pattern (rather than looking for a text string) in an easy and straightforward manner. For example, you could check for the presence of an exact text string within another text string simply by using the Python in
keyword, as shown here:
>>> haystack = 'My phone number is 213-867-5309.'
>>> '213-867-5309' in haystack
True
Sometimes, however, you do not have the exact text you want to match. For example, what if you want to know whether any valid phone number is present in a string? To take that one step further, what if you want to know whether any valid phone number is present in the string, and also want to know what that phone number is?
This is where regular expressions are useful. Their purpose is to specify a pattern of text to identify within a bigger text string. Regular expressions can identify the presence or absence of text matching the pattern, and also split a pattern into one or more subpatterns, delivering the specific text within each.
This chapter explores regular expressions (or regexes, for short). First, you learn how to perform regular expression searches in Python using the re
module. You then explore various regular expressions, beginning with the simple and working toward the more complex. Finally, you learn about regular expression substitution.
You use regular expressions for two common reasons.
The first reason is data mining—that is, when you want to find a pile of text (matching a given pattern) in a bigger pile of text. It is very common to need to identify text that looks like a given type of information (for example, an e-mail address, a URL, a phone number, or the like).
As humans, we identify the type of information being presented based on patterns all the time. A television commercial that shows alphanumeric characters ending in .com
or .org
is intuitively understood to be presenting a web address. Add an @
character, and it is intuitively understood to be an e-mail address instead.
The second reason is validation. You can use regular expressions to establish that you got the data that you expected. It is generally wise to consider “outside” data to be untrustworthy, especially data from users. Regular expressions can help determine whether or not untested data is valid.
The corollary to this is that regular expressions are valuable tools for coercing data into a consistent format. For example, a phone number can be written in multiple valid ways, and if you are asking for user input, you likely want to accept all of them. However, you really only want to store the actual digits of the phone number, which can then be consistently formatted on display. In addition to being useful for validation, regular expressions are useful for this kind of data coercion.
The Python standard library provides the re
module for using regular expressions.
The primary function that the re
module provides is search
. Its purpose is to take a regular expression (the needle) and a string (the haystack), and return the first match found. If no match is found at all, re.search
returns None
.
Consider re.search
in action with the simplest regular expression possible, which is a simple alphanumeric string.
>>> import re
>>> re.search(r'fox', 'The quick brown fox jumped...')
<_sre.SRE_Match object; span=(16, 19), match='fox'>
The regular expression parser's job here is quite simple. It finds the word fox
within the string, and returns a match object.
Observant readers may note that the regular expression was specified slightly differently: r'fox'
. The r
character that precedes the string stands for “raw” (no, it does not stand for “regex”).
The difference between a raw string and a regular string is simply that raw strings do not interpret the \
character as an escape character. This means that, for example, it is not possible to escape a quote character to avoid concluding your string.
However, raw strings are particularly useful for regular expressions because the regular expression engine itself needs the \
character for its own escaping at times. Therefore, using raw strings for regular expressions is very common and very useful. In fact, it is so common that some syntax-highlighting engines will actually provide regular-expression syntax highlighting within raw strings.
Match objects have several methods to tell you things about the match. The group
method is arguably the most important. It returns a string with the text of the match, as shown here:
>>> match = re.search(r'fox', 'The quick brown fox jumped...')
>>> match.group()
'fox'
You may be curious why this method is named group
. This is because regular expressions can be split into multiple subgroups that call out just a subsection of the match. You learn more about this shortly.
Match objects have several other methods. The start
method provides the index in the original string where the match began, and the end
method provides the index in the original string where the match ended.
The groups
and groupdict
methods are used to call out subsections of the regular expression. You learn more about these methods later, during a discussion about regular expressions with backreferences.
Finally, the re
attribute contains the regular expression used in the match, the string
attribute contains the string used as the haystack, and the pos
attribute is set to the position in the string where the search began.
A limitation of re.search
is that it only returns at most one match, in the form of a match object (discussed in more detail shortly). If multiple matches exist within the string, re.search
will only return the first one. Often, this is exactly what you want. However, sometimes you want multiple matches if multiple matches exist.
The re
module provides two functions for this purpose: findall
and finditer
. Both of these methods return all non-overlapping matches, including empty matches. The re.findall
method returns a list, and re.finditer
returns a generator.
However, there is a key difference here. These methods do not actually return a match object. Instead, they return simply the match itself, either as a string or a tuple, depending on the content of the regular expression.
Consider an example of findall
:
>>> import re
>>> re.findall(r'o', 'The quick brown fox jumped...')
['o', 'o']
In this case, it returns a list with two o
characters, because the o
character appears twice in the string.
The simplest regular expression is one that contains plain alphanumeric characters—and nothing else. This is actually easy to overlook. Many regular expressions use direct text matching.
The string Python
is a valid regular expression. It matches that word, and nothing else. Regular expressions, by default, are also case-sensitive, so it will not match python
or PYTHON
.
>>> re.search(r'Python', 'python')
>>> re.search(r'Python', 'PYTHON')
It will, however, match the word in a larger block of text. It will match the word in Python 3
, or This is Python code
, or the like, as shown here:
>>> re.search(r'Python', 'Python 3')
<_sre.SRE_Match object; span=(0, 6), match='Python'>
>>> re.search(r'Python', 'This is Python code.')
<_sre.SRE_Match object; span=(8, 14), match='Python'>
Of course, there is essentially no value in using regular expressions just to match plaintext regular expressions. After all, it would be trivially easy to use the in
operator to test for the presence of a string within another string, and str.index
is more than up to the task of telling you where in a larger string a substring occurs.
The power of regular expressions lies in their capability to specify patterns of text to be matched.
Character classes enable you to specify that a single character should match one of a set of possible characters, rather than just a single character. You can denote a character class by using square brackets and listing the possible characters within the brackets.
For example, consider a regular expression that should match either Python
or python
: [Pp]ython
.
What is happening here? The first token in the regular expression is actually a character class with two options: P
and p
. Either character will match, but nothing else. The remaining five characters are just literal characters.
What does the following regular expression match?
>>> re.search(r'[Pp]ython', 'Python 3')
<_sre.SRE_Match object; span=(0, 6), match='Python'>
>>> re.search(r'[Pp]ython', 'python 3')
<_sre.SRE_Match object; span=(0, 6), match='python'>
This regular expression matches the word Python
in the string Python 3
and the word python
in the string python 3
. It does not make the entire word case-insensitive, though. It does not match the word in all caps, for example.
>>> re.search(r'[Pp]ython', 'PYTHON 3')
>>>
Another use for this kind of character class is for words with multiple spellings. The regular expression gr[ae]y
will match either gray
or grey
, allowing you to quickly identify and extract either spelling.
>>> re.search(r'gr[ae]y', 'gray')
<_sre.SRE_Match object; span=(0, 4), match='gray'>
It is also worth noting that character classes like this match one and exactly one character.
>>> re.search(r'gr[ae]y', 'graey')
>>>
Here, the regular expression engine successfully matches the literal g
, then the literal r
. Next, the engine is given the character class [ae]
, and matches it against the a
. Now, the character class has been matched, and the engine moves on. The next character in the regular expression is a y
, but the next character in the string is an e
. This is not a match, so the regular expression parser moves on, starting over and looking for a starting g
. When it gets to the end of the string and fails to find one, it returns None
.
Some quite common character classes are very large. For example, consider trying to match any digit. It would be quite unwieldy to provide [0123456789]
each time. It would be even more unwieldy to provide every letter, both capitalized and lowercase, each time.
To accommodate for this, the regular expression engine uses the hyphen character (-
) within character classes to denote ranges. A character class to match any digit could be written [0-9]
instead. It is also possible to use more than one range within a character class, simply by providing the ranges next to one another. The [a-z]
character class matches only lowercase letters, and the [A-Z]
character class matches only capital letters. These can be combined—[A-Za-z]
would match both lowercase and capital letters.
>>> re.search(r'[a-zA-Z]', 'x')
<_sre.SRE_Match object; span=(0, 1), match='x'>
>>> re.search(r'[a-zA-Z]', 'B')
<_sre.SRE_Match object; span=(0, 1), match='B'>
Of course, you may also want to match the literal hyphen character. This is surprisingly common. Many reasons exist to match (for example) alphanumeric characters, hyphen, and underscore. What happens when you want to do this?
You can escape the hyphen: [A-Za-z0-9\-_]
. This will tell the regular expression engine that you want a literal hyphen. However, escaping generally makes things more difficult to read. You can also provide the hyphen as either the first or last character in the character class, as in [A-Za-z0-9_-]
. In this case, the engine will interpret the character as a literal hyphen.
The character classes shown thus far are all defined by what characters may occur. However, you may want to define a character class by what characters may not occur.
You can invert a character class (meaning that it will match any character other than those specified) by beginning the character class with a ˆ
character.
>>> re.search(r'[ˆa-z]', '4')
<_sre.SRE_Match object; span=(0, 1), match='4'>
>>> re.search(r'[ˆa-z]', '#')
<_sre.SRE_Match object; span=(0, 1), match='#'>
>>> re.search(r'[ˆa-z]', 'X')
<_sre.SRE_Match object; span=(0, 1), match='X'>
>>> re.search(r'[ˆa-z]', 'd')
>>>
In this scenario, the regular expression parser looks for literally any character other than a
through z
. Therefore, it matches against numbers, capital letters, and symbols, but not lowercase letters.
It is important to note specifically what the regular expression is looking for here. It is looking for the presence of a character that does not match any of the characters in the character class. It is not looking for (and will not match) the absence of a character.
Consider the regular expression n[ˆe]
. This means the character n
followed by any character that is not an e
.
>>> re.search(r'n[ˆe]', 'final')
<_sre.SRE_Match object; span=(2, 4), match='na'>
In this case, it matches against the word final
, and the match is na
. The a
character is part of the match, because it is a single character that is not an e
.
The regular expression will fail to match if it follows an n
followed by an e
, as you expect.
>>> re.search(r'n[ˆe]', 'jasmine')
>>>
Here, the regular expression engine gets to the only n
in the string but cannot match the next character, because it is an e
, and thus there is no match.
However, the regular expression also will not match against an n
at the end of the string.
>>> re.search(r'n[ˆe]', 'Python')
>>>
The regular expression finds the n
in the word Python
. However, that is as far as it gets. There is no character remaining in the string to match against [ˆe]
, and, therefore, the match fails.
Several common character classes also have predefined shortcuts within the regular expression engine. If you want to define “words,” your instinct may be to use [A-Za-z]
. However, many words use characters that fall outside of this range.
The regular expression engine provides a shortcut, \w
, which matches “any word character.” How “any word character” is defined varies somewhat based on your environment. In Python 3, it will essentially match nearly any word character in any language. In Python 2, it will only match the English word characters. In both cases, it also matches digits, _
, and -
.
The \d
shortcut matches digit characters. In Python 3, it matches digit characters in other languages. In Python 2, it matches only [0-9]
.
The \s
shortcut matches whitespace characters, such as space, tab, newline, and so on. The exact list of whitespace characters is greater in Python 3 than in Python 2.
Finally, the \b
shortcut matches a zero-length substring. However, it only matches it at the beginning or end of a word. This is called the word boundary character shortcut.
>>> re.search(r'\bcorn\b', 'corn')
<_sre.SRE_Match object; span=(0, 4), match='corn'>
>>> re.search(r'\bcorn\b', 'corner')
>>>
The regular expression engine matches the word corn
here when it is by itself, but fails to match the word corner
, because the trailing \b
does not match (because the next character is e
, which is a word character).
It is worth noting that these shortcuts work both within character classes and outside of them. For example, the regular expression \w
will match any word character.
>>> re.search(r'\w', 'Python 3')
<_sre.SRE_Match object; span=(0, 1), match='P'>
Because re.search
only returns the first match, it matches the P
character and then completes. Consider the result of re.findall
using the same regular expression and string.
>>> re.findall(r'\w', 'Python 3')
['P', 'y', 't', 'h', 'o', 'n', '3']
Note that the regular expression matches every character in the string except the space. The \w
shortcut does include digits in the Python regular expression engine.
The \w
, \d
, and \s
shortcuts also include negation shortcuts: \W
, \D
, and \S
. These shortcuts match any character other than the characters in the shortcut. Note again that these still require a character to be present. They do not match an empty string.
There is also a negation shortcut for \b
, but it works slightly differently. Whereas \b
matches a zero-length substring at the beginning or end of a word, \B
matches a zero-length substring that is not at the beginning or end of a word. This essentially reverses the corn
and corner
example from earlier.
>>> re.search(r'corn\B', 'corner')
<_sre.SRE_Match object; span=(0, 4), match='corn'>
>>> re.search(r'corn\B', 'corn')
>>>
Two special characters designate the beginning of a string and end of a string.
The ˆ
character designates the beginning of a string, as shown here:
>>> re.search(r'ˆPython', 'This code is in Python.')
>>> re.search(r'ˆPython', 'Python 3')
<_sre.SRE_Match object; span=(0, 6), match='Python'>
Notice that the first command fails to produce a match. This is because the string does not start with the word Python
, and the ˆ
character requires that the regular expression match against the beginning of the string.
Similarly, the $
character designates the end of a string, as shown here:
>>> re.search(r'fox$', 'The quick brown fox jumped over the lazy dogs.')
>>> re.search(r'fox$', 'The quick brown fox')
<_sre.SRE_Match object; span=(16, 19), match='fox'>
Again, notice that the first command fails to produce a match, because although the word fox
appears, it is not at the end of the string, which the $
character requires.
The .
character is the final shortcut character. It stands in for any single character. However, it only serves this role outside a bracketed character class.
Consider the following simple regex using the .
character:
>>> re.search(r'p.th.n', 'python 3')
<_sre.SRE_Match object; span=(0, 6), match='python'>
>>> re.search(r'p..hon', 'python 3')
<_sre.SRE_Match object; span=(0, 6), match='python'>
In each of these cases, the period steps in for one single character. In the first example, the regular expression engine finds the character .
in the regular expression. In the string, it sees a y
, and matches and continues to the next character (a t
against a t
).
In the second case, the same fundamental thing is happening. Each period character matches one and exactly one character. It matches the y
and the t
, and then this consumes both of the periods, and the regular expression engine continues to the next character (this time, an h
against an h
).
Note that there is one character that the .
does not match, which is newline (\n
). It is possible to make the .
character match newline, however, which is discussed later in this chapter.
Thus far, all of the regular expressions you have seen have involved a 1:1 correlation between characters in the regular expression itself and characters in the string being searched.
Sometimes, however, a character may be optional. Consider again the example of a word with more than one correct spelling, but this time, the inclusion of a letter is what separates the two spellings, such as “color” and “colour,” or “honor” and “honour.”
You can specify a character, character class, or other atomic unit within a regular expression as optional by using the ?
character, which means that the regular expression engine will expect the token to occur either zero times or once.
For example, you can match the word “honor” with its British spelling “honour” by using the regular expression honou?r
.
>>> import re
>>> re.search(r'honou?r', 'He served with honor and distinction.')
<_sre.SRE_Match object; span=(15, 20), match='honor'>
>>> re.search(r'honou?r', 'He served with honour and distinction.')
<_sre.SRE_Match object; span=(15, 21), match='honour'>
In both cases, the regular expression contains four literal characters, hono
. These match the hono
in both honor
and honour
. The next thing that the regular expression hits is an optional u
. In the first case, the u
is absent, but this is okay because the regular expression marks it as optional. In the second case, the u
is present, which is also okay. In both cases, the regular expression then seeks a literal r
character, which it finds, therefore completing the match.
Thus far, you have learned only about characters (or character classes) that occur once and exactly once, or that are entirely optional (occurring zero times or once). However, sometimes you need the same character or character class to repeat.
You may expect a character class to recur a set number of consecutive times, such as in a phone number. American phone numbers comprise the country code 1
(often omitted), an area code, which is three digits, then the seven-digit phone number, with the third and fourth digit of the latter separated by a hyphen, period, or similar.
You can designate that a token must repeat a given number of times with {N}
, where the N
character corresponds to the number of times the token should repeat.
The following uses a regular expression to identify a seven-digit, local phone number (ignore the country code and area code for the moment): [\d]{3}-[\d]{4}
.
>>> re.search(r'[\d]{3}-[\d]{4}', '867-5309 / Jenny')
<_sre.SRE_Match object; span=(0, 8), match='867-5309'>
In this case, the regular expression engine starts by looking for three consecutive digits. It finds them (867
), and then moves on to the literal hyphen character. Because this hyphen character is not within a character class, it carries no special meaning and simply matches the literal hyphen. The regular expression then finds the final four consecutive digits (5309
) and returns the match.
Sometimes, you may not know exactly how many times the token ought to repeat. Phone numbers may contain a static number of digits, but lots of numeric data is not standardized this way.
For example, consider credit card security codes. Credit cards issued in the United States contain a special security code on the back, often called a “CVV code.” Most credit card brands use three-digit security codes, which you can match with [\d]{3}
. However, American Express uses four-digit security codes ([\d]{4}
).
What if you want to be able to match both of these cases? Repetition ranges come in handy here. The syntax here is {M,N}
, where M
is the lower bound and N
is the upper bound.
It is worth noting here that the bounds are inclusive. If you want to match three digits or four digits, the correct syntax is [\d]{3,4}
. You might be tempted (based on using Python slices) to believe that the upper bound is exclusive (and that you should use {3,5}
instead). However, regular expressions do not work this way.
>>> re.search(r'[\d]{3,4}', '0421')
<_sre.SRE_Match object; span=(0, 4), match='0421'>
>>> re.search(r'[\d]{3,4}', '615')
<_sre.SRE_Match object; span=(0, 3), match='615'>
In both cases, the regular expression engine finds a series of digits that matches what it expects, and returns a match.
When given the choice to match three characters or four characters, where either is a valid match, how does the regular expression engine decide? The answer is that, under most circumstances, the regular expression engine is “greedy,” meaning that it will match as many characters as possible for as long as it can. In this simple case, that means that if there are four digits, four digits will be matched.
Occasionally, this behavior is undesirable. By placing a ?
character immediately after the repetition operator, it causes that repetition to be considered “lazy,” meaning that the engine will match as few characters as possible to return a valid match.
>>> re.search(r'[\d]{3,4}?', '0421')
<_sre.SRE_Match object; span=(0, 3), match='042'>
The re-use of the ?
character for another purpose does not cause any ambiguity for the parser, because the character comes after repetition syntax, rather than a token to be matched against.
Note that the ?
in this situation does not serve to make the repeated segment optional. It simply means that, given the opportunity to match three or four digits, it will elect only to match three.
You also may encounter cases where there is no upper bound for the number of times that a token may repeat. For example, consider a traditional street address. This usually starts with a number (for the moment, hand-wave the exceptions and assert that they always do), but the number could be any arbitrary length. There is nothing technically invalid about an eight-digit street number.
In these cases, you can leave off the upper bound, but retain the ,
character to designate that the upper bound is ∞. For example, {1,}
designates one or more occurrences with no upper bound.
>>> re.search(r'[\d]{1,}', '1600 Pennsylvania Ave.')
<_sre.SRE_Match object; span=(0, 4), match='1600'>
This syntax also works if you do not want to specify a lower bound, in which case, the lower bound is assumed to be 0
.
You can use two shorthand characters in designating common repetition situations. You can use the +
character in lieu of specifying {1,}
(one or more). Similarly, you can use the *
character in lieu of specifying {0,}
(zero or more).
Therefore, the previous example could be rewritten using +
, as shown here:
>>> re.search(r'[\d]+', '1600 Pennsylvania Ave.')
<_sre.SRE_Match object; span=(0, 4), match='1600'>
Using +
and *
generally makes for a regular expression that is easier to read, and is the preferred syntax in cases where they are applicable.
Regular expressions provide a mechanism to split the expression into groups. When using groups, you are able to select each individual group within the match in addition to getting the entire match. You can specify groups within a regular expression by using parentheses.
The following is an example of a simple, local phone number. However, this time, each set of digits is a group.
>>> match = re.search(r'([\d]{3})-([\d]{4})', '867-5309 / Jenny')
>>> match
<_sre.SRE_Match object; span=(0, 8), match='867-5309'>
As before, you can use the group
method on the match object to return the entire match.
>>> match.group()
'867-5309'
The re
module's match objects provide a method, groups
, which returns a tuple corresponding to each individual group.
>>> match.groups()
('867', '5309')
By breaking your regular expression into subgroups like this, you can quickly get not just the entire match, but specific bits of data within the match.
It is also possible to get just a single group, by passing an argument to the group
method corresponding to the group you want back (note that group numbers are 1-indexed).
>>> match.group(2)
'5309'
By using groups, you can take a phone number formatted in a variety of different ways and extract only the data that matters, which is the actual digits of a phone number.
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '(213) 867-5309')
<_sre.SRE_Match object; span=(0, 14), match='(213) 867-5309'>
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '213-867-5309')
<_sre.SRE_Match object; span=(0, 12), match='213-867-5309'>
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '213.867.5309')
<_sre.SRE_Match object; span=(0, 12), match='213.867.5309'>
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '2138675309')
<_sre.SRE_Match object; span=(0, 10), match='2138675309'>
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})', '+1
... (213) 867-5309')
<_sre.SRE_Match object; span=(0, 17), match='+1 (213) 867-5309'>
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})', '1
... (213) 867-5309')
<_sre.SRE_Match object; span=(0, 16), match='1 (213) 867-5309'>
>>> re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '1-213-867-5309')
<_sre.SRE_Match object; span=(0, 14), match='1-213-867-5309'>
This regular expression is a bit more complicated than what you have encountered already. Consider each distinct part by itself, however, and it is easier to parse.
The first segment is (\+?1)?[ .-]?
. This is first looking for the United States country code in almost any format you may encounter it (+1
or 1
, and then possibly a hyphen).
The second segment is \(?([\d]{3})\)?[ .-]?
, and it grabs the area code, and the optional hyphen or whitespace that may follow it. The area code may optionally be provided in parentheses (as is common with U.S. phone numbers).
The remainder of the regular expression is the final seven digits of the phone number, and is the same as what you have already seen.
Regardless of how the phone number is formatted, the regular expression is capable of matching it. And although the full match is still formatted based on the original data provided, the groups are consistently the same.
>>> match = re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '213-867-5309')
>>> match.groups()
(None, '213', '867', '5309')
>>> match = re.search(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... '+1 213-867-5309')
>>> match.groups()
('+1', '213', '867', '5309')
The only difference between the groups is based on what was provided for the country code. If it is omitted, then it is not captured either, and None
is provided in its place. The second through fourth groups consistently contain the three (intra-national) segments of the phone number.
Up until this point, the examples have consistently used the group
method to return the entire match, rather than just a single group. In fact, it may seem like very odd nomenclature indeed to have to call the group
method to get back the entire match in the first place.
Why does it work this way? The purpose of the group
is actually to return a single group from the match. It takes an optional argument, which is the number of the group to return. If the argument is omitted (as the examples had consistently been doing), it defaults to 0
.
In regular expressions, the groups are counted based on their position in the regular expression, starting with 1
.
The 0
group is special, and corresponds to the entire match. This is why groups are 1-indexed. By calling group
with no argument, you are asking for group 0
and, therefore, getting the entire match back.
In addition to having positionally numbered groups, the Python regular expression engine also provides a mechanism for naming groups. This functionality was actually originally introduced by the Python regular expression implementation, although many other languages have picked it up at this point.
The syntax for a named group is to add ?P<group_name>
immediately after the opening (
character. You could specify the local phone number regular expression to use named groups by rewriting it as (?P<first_three>[\d]{3})-(?P<last_four>[\d]{4}
.
>>> match = re.search(r'(?P<first_three>[\d]{3})-(?P<last_four>[\d]{4})',
... '867-5309')
>>> match
<_sre.SRE_Match object; span=(0, 8), match='867-5309'>
First of all, note that named groups are also still positional groups. You can (if you choose) still look up the groups this way:
>>> match.groups()
('867', '5309')
>>> match.group(1)
'867'
Using named groups opens up two more ways to look up a group. First, the name of the group can be passed as a string to the group
method.
>>> match.group('first_three')
'867'
Additionally, match objects provide a groupdict
method. This method is similar in most ways to the groups
method, except that it returns a dictionary instead of a tuple, and the dictionary keys correspond to the names of the groups.
>>> match.groupdict()
{'first_three': '867', 'last_four': '5309'}
It is worth noting that groupdict
, like groups
, does not return the entire match; it only returns the subgroups. Also, if you have a mix of named groups and unnamed groups, the unnamed groups are not part of the dictionary returned by groupdict
.
>>> match = re.search(r'(?P<first_three>[\d]{3})-([\d]{4})', '867-5309')
>>> match.groups()
('867', '5309')
>>> match.groupdict()
{'first_three': '867'}
In this case, only the first group (named first_three
) is a named group, and the second group is a numbered group only. Therefore, when groups
is called, both groups are returned in the tuple. However, when groupdict
is called, only the first_three
group is included in the result.
Named groups are quite valuable for maintenance reasons. You may reference a group in code later. If you use primarily named groups, adding a new group to the regular expression to account for a change does not then require updating group numbers later in code, because the existing names stay the same.
The regular expression engine also provides a mechanism to reference a previously matched group. Sometimes, you may be looking for a subsequent occurrence of the same submatch.
For example, if you are trying to parse a block of XML, you may want to very permissively look for any valid opening tag, such as <([\w_-]+)>
. However, you want to ensure that the same closing tag exists.
It is insufficient to simply repeat this pattern a second time. On the one hand, it will correctly match patterns that you want.
>>> re.search(r'<([\w_-]+)>stuff</([\w_-]+)>', '<foo>stuff</foo>')
<_sre.SRE_Match object; span=(0, 16), match='<foo>stuff</foo>'>
On the other hand, it would also match patterns that should not actually match.
>>> match = re.search(r'<([\w_-]+)>stuff</([\w_-]+)>', '<foo>stuff</bar>')
>>> match
<_sre.SRE_Match object; span=(0, 16), match='<foo>stuff</bar>'>
>>> match.group(1)
'foo'
>>> match.group(2)
'bar'
Here, the regular expression engine correctly sees <foo>
as an opening XML tag, matches it, and assigns the text foo
to the subgroup. It then matches the literal characters stuff
, and then goes to match the closing XML tag.
At this point, what you intuitively want is for the match to fail, because the closing XML tag is </bar>
, which is not the same as the opening tag of <foo>
.
The regular expression engine does not do that, however. It has simply been told to match the </
and >
wrapping characters, and then word characters in between. Because bar
fulfills this requirement, the engine matches it, assigns it to the second subgroup, and returns a match.
What you really want at this point is for the regular expression engine to require the same submatch as was used in the first group. This should make a string of <foo>stuff</foo>
match, but a string of <foo>stuff</bar>
fail to match.
The regular expression engine provides a way to do this using backreferences. Backreferences refer to a previously matched group within a regular expression, and cause the regular expression parser to expect the same match text to occur again.
You backreference numbered groups using \N
, where N
is the group number. Therefore, \1
will match the first group, \2
the second group, and so on. This syntax is capable of matching up to the first 99 groups.
Consider the following XML regular expression that uses a backreference:
>>> match = re.search(r'<([\w_-]+)>stuff</\1>', '<foo>stuff</foo>')
>>> match
<_sre.SRE_Match object; span=(0, 16), match='<foo>stuff</foo>'>
>>> match.groups()
('foo',)
Notice that there is only one subgroup now. In the previous example, there were two, both containing the text foo
. In this case, however, a backreference has replaced the second group.
A much more important distinction, however, is what this regular expression does not match.
>>> re.search(r'<([\w_-]+)>stuff</\1>', '<foo>stuff</bar>')
>>>
In this case, the regular expression engine successfully matches up to the closing XML tag. However, because bar
is not the same text as foo
, the match fails.
Earlier, you learned about negated character classes, which enable you to match any character other than those in the class. As mentioned before, this method makes the character or characters matched by the negated character class be part of the match, and it will not match the absence of any character at all.
There is, however, a mechanism to accept or reject a match based on the presence or absence of content after it, without making the subsequent content part of the match. This is called lookahead.
The previous example of a negated character class was n[ˆe]
—an n
followed by a character that is not an e
. This matched na
in final
, failed to match anything in jasmine
, and failed to match anything in Python
.
A similar regular expression that instead uses negative lookahead would employ the syntax n(?!e)
.
>>> re.search(r'n(?!e)', 'final')
<_sre.SRE_Match object; span=(2, 3), match='n'>
>>> re.search(r'n(?!e)', 'jasmine')
>>> re.search(r'n(?!e)', 'Python')
<_sre.SRE_Match object; span=(5, 6), match='n'>
These results are slightly different than when a negated character class was used. In the first example, using the word final
, the regular expression again matches, but the match is different. While the negated character class made the a
character part of the match, negative lookahead does not, and the match comes back as just the n
character.
The second result is the most similar. The n
in jasmine
matches the n
character in the regular expression. However, because the n
is followed by an e
, it is disqualified, and the match fails.
The final result is the most different, because this match actually succeeds, where it did not with a negated character class. The regular expression engine matches the n
in Python
. It then reaches the end of the string. Because that n
is not followed by an e
, the match succeeds and is returned.
It is worth noting that while this may look like group syntax, in this case, a group is not saved.
>>> match = re.search(r'n(?!e)', 'final')
>>> match
<_sre.SRE_Match object; span=(2, 3), match='n'>
>>> match.groups()
()
The regular expression engine also supports a different kind of lookahead, called a positive lookahead. This requires that the match be followed by the character or characters in question, but nonetheless does not make those characters part of the match.
The syntax for positive lookahead simply replaces the !
character with =
. Consider this regular expression:
>>> re.search(r'n(?=e)', 'jasmine')
<_sre.SRE_Match object; span=(5, 6), match='n'>
In this case, the regular expression engine matches the n
in the word jasmine
. After doing so, it verifies that the subsequent character is an e
, as the regular expression requires. Because it is, the match is complete and returned. As before, no group is created by the lookahead.
Without the e
, the match fails, as shown here:
>>> re.search(r'n(?=e)', 'jasmin')
>>>
In this case, the regular expression engine again matches the n
, but disqualifies the match because it is not followed by an e
.
Sometimes, you need to slightly tweak the behavior of the regular expression engine. The regular expression engines in most languages, including Python, offer a small number of flags that modify the behavior of the entire expression.
The Python engine offers several flags that can be sent to a regular expression when using re.search
or similar functions. In the case of re.search
, it takes a third argument for flags.
The simplest and most straightforward flag is re.IGNORECASE
, which causes the regular expression to become case-insensitive.
>>> re.search(r'python', 'PYTHON IS AWESOME', re.IGNORECASE)
<_sre.SRE_Match object; span=(0, 6), match='PYTHON'>
When using re.IGNORECASE
, the match will still be returned using the case of the string in which it was found, and not the case of the regular expression.
re.IGNORECASE
is also aliased to re.I
.
You may recall that there is a difference between how some character shortcuts work between Python 2 and Python 3. For example, \w
in Python 3 matches word characters in nearly any language, rather than just the Latin alphabet.
The re
module provides flags to make Python 2 follow the Python 3 behavior, and also flags to make Python 3 follow the Python 2 behavior.
The re.UNICODE
(aliased to re.U
) flag forces the regular expression engine to follow the Python 3 behavior. This flag is defined in both Python 2 and Python 3, so it is safe to use it in code designed to run on either platform. Note that if you try to use a byte string with re.U
in Python 3, the parser will raise an exception.
The re.ASCII
(aliased to re.A
) flag forces the regular expression to follow the Python 2 behavior. Unlike re.UNICODE
, the re.ASCII
flag is not available in Python 2. If you need re.ASCII
in code that runs under both Python 2 and Python 3, use the appropriate character classes instead, or do a version check before applying the flag.
The re.DOTALL
flag (aliased to re.S
to match the terminology used in Perl and elsewhere) causes the .
character to match newline characters in addition to all other characters.
>>> re.search(r'.+', 'foo\nbar')
<_sre.SRE_Match object; span=(0, 3), match='foo'>
>>> re.search(r'.+', 'foo\nbar', re.DOTALL)
<_sre.SRE_Match object; span=(0, 7), match='foo\nbar'>
In the first command, the regular expression engine must match one or more of any character. It matches foo
, and then it reaches a line break and stops, because .
does not normally match line breaks.
However, in the second command, re.DOTALL
is passed, and the line break character is included in what .
matches against. Therefore, the regular expression engine (being greedy) keeps going until it reaches end of string, and the entire string is returned as the match.
The re.MULTILINE
flag (aliased to re.M
) causes the ˆ
and $
characters, which normally would only match against the beginning or end of the string (respectively), to instead match against the beginning or end of any line within the string.
>>> re.search(r'ˆbar', 'foo\nbar')
>>> re.search(r'ˆbar', 'foo\nbar', re.MULTILINE)
<_sre.SRE_Match object; span=(4, 7), match='bar'>
In the first command, the ˆ
character is only able to match against the beginning of the string. Therefore, the word bar
does not match, because it is not the first thing in the string.
In the second command, however, the re.MULTILINE
flag is used. Therefore, the ˆ
character merely requires the beginning of a line. Because a newline character immediately precedes bar
, it matches and the match is returned.
The re.VERBOSE
flag (aliased to re.X
) allows for complicated regular expressions to be expressed in a more readable way.
This flag does two things. First, it causes all whitespace (other than in character classes) to be ignored, including line breaks. Second, it treats the #
character (again, unless it's inside a character class) as a comment character.
This allows for easy annotation of regular expressions, which can be valuable as they become complicated. The following two commands are equivalent:
>>> re.search(r'(?P<first_three>[\d]{3})-(?P<last_four>[\d]{4})', '867-5309')
<_sre.SRE_Match object; span=(0, 8), match='867-5309'>
>>> re.search(r"""(?P<first_three>[\d]{3}) # The first three digits
... - # A literal hyphen
... (?P<last_four>[\d]{4}) # The last four digits
... """, '867-5309', re.VERBOSE)
<_sre.SRE_Match object; span=(0, 8), match='867-5309'>
The re.DEBUG
flag (not aliased) dumps some debugging information out to sys.stderr
while compiling a regular expression.
>>> re.search(r'(?P<first_three>[\d]{3})-(?P<last_four>[\d]{4})',
'867-5309', re.DEBUG)
subpattern 1
max_repeat 3 3
in
category category_digit
literal 45
subpattern 2
max_repeat 4 4
in
category category_digit
<_sre.SRE_Match object; span=(0, 8), match='867-5309'>
Occasionally, you may need to use more than one of these flags at once. To do this, join them with the |
(bitwise OR) operator. For example, if you need both the re.DOTALL
and re.MULTILINE
flags, the correct syntax is re.DOTALL | re.MULTILINE
or re.S | re.M
.
It is also possible to use flags within a regular expression itself by beginning the regular expression with special syntax. This uses the short-form flag, and looks like this:
>>> re.search('(?i)FOO', 'foo').group()
'foo'
Note the (?i)
at the beginning. This is the equivalent of using the re.IGNORECASE
flag. However, this syntax is usually less preferable to sending flags explicitly. Also, the long form of the flags will not work. (?ignorecase)
is not valid and will raise an exception.
The regular expression engine is not limited to simply identifying whether a pattern exists within a string. It is also capable of performing string replacement, returning a new string based on the groups in the original one.
The substitution method in Python is re.sub
. It takes three arguments: the regular expression, the replacement string, and the source string being searched. Only the actual match is replaced, so if there is no match, re.sub
ends up being a no-op.
re.sub
enables you to use the same backreferences from regular expression patterns within the replacement string. Consider the task of stripping irrelevant formatting data from a phone number:
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'\2\3\4',
... '213-867-5309')
'2138675309'
Because this regular expression matches nearly any phone number and groups only the actual digits of the phone number, you will get back the same data regardless of how the original number was formatted.
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'\2\3\4',
... '213.867.5309')
'2138675309'
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'\2\3\4',
... '2138675309')
'2138675309'
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'\2\3\4',
... '(213) 867-5309')
'2138675309'
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'\2\3\4',
... '1 (213) 867-5309')
'2138675309'
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'\2\3\4',
... '+1 213-867-5309')
'2138675309'
The replacement string is not limited to just using the backreferences from the string; other characters are interpreted literally. Therefore, re.sub
can also be used for formatting. For example, what if you want to display a phone number rather than store it, but you want to display it in a consistent format? re.sub
can handle that, as shown here:
>>> re.sub(r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})',
... r'(\2) \3-\4',
... '+1 213-867-5309')
'(213) 867-5309'
Everything here is the same as in the previous examples, except for the replacement string, which has gained the parentheses, space, and hyphen. Therefore, so has the result.
One final feature of Python's regular expression implementation is compiled regular expressions. The re
module contains a function, compile
, which returns a compiled regular expression object, which can then be reused.
The re
module caches regular expressions that it compiles on the fly, so in most situations, there is no substantial performance advantage to using compile
. It can be extremely useful for passing regular expression objects around, however.
The re.compile
function returns a regular expression object, with the compiled regular expression as data. These objects have their own search
and sub
methods, which omit the first argument (the regular expression itself).
>>> regex = re.compile(
... r'(\+?1)?[ .-]?\(?([\d]{3})\)?[ .-]?([\d]{3})[ .-]?([\d]{4})'
... )
>>> regex.search('213-867-5309')
<_sre.SRE_Match object; span=(0, 12), match='213-867-5309'>
>>> regex.sub(r'(\2) \3-\4', '+1 213.867.5309')
'(213) 867-5309'
Also, there is one other advantage to using re.compile
. The search
method of regular expression objects actually allows for two additional arguments not available on re.search
. These are the starting and ending positions of the string to be searched against, enabling you to exempt some of the string from consideration.
>>> regex = re.compile('[\d]+')
>>> regex.search('1 mile is equal to 5280 feet.')
<_sre.SRE_Match object; span=(0, 1), match='1'>
>>> regex.search('1 mile is equal to 5280 feet.', pos=2)
<_sre.SRE_Match object; span=(19, 23), match='5280'>
The values sent are available as the pos
and endpos
attributes on the match objects returned.
Regular expressions are extremely useful tools for finding, parsing, and validating data. They often look intimidating to those who have not used them before, but they are manageable if taken piece by piece.
In addition, mastering regular expressions will enable you to perform parsing and formatting tasks that are much more difficult without a pattern-matching algorithm.
However, be wary of using regular expressions when they are unnecessary. Sometimes, using a few lines of code with direct string comparison is much more straightforward. Like any tool, regular expressions should be used when they are the appropriate solution, but not when simpler approaches are available to you.
Similarly, bear in mind that regular expressions are often unsuitable for parsing extremely complex structures. If you are parsing a non-trivial document format, you should probably be looking for another library that handles that for you.
Chapter 10 examines testing applications in Python.